{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature: POS/NER Tag Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Derive bag-of-POS-tag and bag-of-NER-tag vectors from each question and calculate their vector distances."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pygoose import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import cosine, euclidean, jaccard"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Automatically discover the paths to various data folders and compose the project structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "project = kg.Project.discover()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identifier for storing these features on disk and referring to them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_list_id = 'nlp_tags'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original question datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('')\n",
    "df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Preprocessed and tokenized questions.\n",
    "\n",
    "We should not use lowercased tokens here because that would harm the named entity recognition process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokens_train = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_train.pickle')\n",
    "tokens_test = kg.io.load(project.preprocessed_data_dir + 'tokens_spellcheck_test.pickle')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_all_texts = pd.DataFrame(\n",
    "    [[' '.join(pair[0]), ' '.join(pair[1])] for pair in tokens_train + tokens_test],\n",
    "    columns=['question1', 'question2'],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dependency parsing takes a lot of time and we don't use any features from it. Let's disable it in the pipeline.\n",
    "\n",
    "If model loading fails, run `python -m spacy download en`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load('en', parser=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "pos_tags_whitelist = ['ADJ', 'ADV', 'NOUN', 'PROPN', 'NUM', 'VERB']\n",
    "ner_tags_whitelist = ['GPE', 'LOC', 'ORG', 'NORP', 'PERSON', 'PRODUCT', 'DATE', 'TIME', 'QUANTITY', 'CARDINAL']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_raw_features = len(pos_tags_whitelist) + len(ner_tags_whitelist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "X1 = np.zeros((len(df_all_texts), num_raw_features))\n",
    "X2 = np.zeros((len(df_all_texts), num_raw_features))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((2750086, 16), (2750086, 16))"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X1.shape, X2.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Collect POS and NER tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe_q1 = nlp.pipe(df_all_texts['question1'].values, n_threads=os.cpu_count())\n",
    "pipe_q2 = nlp.pipe(df_all_texts['question2'].values, n_threads=os.cpu_count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2750086/2750086 [2:43:24<00:00, 280.49it/s]  \n"
     ]
    }
   ],
   "source": [
    "for i, doc in progressbar(enumerate(pipe_q1), total=len(df_all_texts)):\n",
    "    pos_counter = Counter(token.pos_ for token in doc)\n",
    "    ner_counter = Counter(ent.label_ for ent in doc.ents)\n",
    "    X1[i, :] = np.array(\n",
    "        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +\n",
    "        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 2750086/2750086 [2:36:32<00:00, 292.79it/s]  \n"
     ]
    }
   ],
   "source": [
    "for i, doc in progressbar(enumerate(pipe_q2), total=len(df_all_texts)):\n",
    "    pos_counter = Counter(token.pos_ for token in doc)\n",
    "    ner_counter = Counter(ent.label_ for ent in doc.ents)\n",
    "    X2[i, :] = np.array(\n",
    "        [pos_counter[pos_tag] for pos_tag in pos_tags_whitelist] +\n",
    "        [ner_counter[ner_tag] for ner_tag in ner_tags_whitelist]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create tag feature sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_pos_q1 = pd.DataFrame(\n",
    "    X1[:, 0:len(pos_tags_whitelist)],\n",
    "    columns=['pos_q1_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_pos_q2 = pd.DataFrame(\n",
    "    X2[:, 0:len(pos_tags_whitelist)],\n",
    "    columns=['pos_q2_' + pos_tag.lower() for pos_tag in pos_tags_whitelist]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ner_q1 = pd.DataFrame(\n",
    "    X1[:, -len(ner_tags_whitelist):],\n",
    "    columns=['ner_q1_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_ner_q2 = pd.DataFrame(\n",
    "    X2[:, -len(ner_tags_whitelist):],\n",
    "    columns=['ner_q2_' + ner_tag.lower() for ner_tag in ner_tags_whitelist]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compute pairwise distances"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_vector_distances(i):\n",
    "    return [\n",
    "        # POS distances.\n",
    "        cosine(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),\n",
    "        euclidean(X1[i, 0:len(pos_tags_whitelist)], X2[i, 0:len(pos_tags_whitelist)]),\n",
    "\n",
    "        # NER distances.\n",
    "        euclidean(X1[i, -len(ner_tags_whitelist):], X2[i, -len(ner_tags_whitelist):]),\n",
    "        np.abs(np.sum(X1[i, -len(ner_tags_whitelist):]) - np.sum(X2[i, -len(ner_tags_whitelist):])),\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████| 2751/2751 [06:39<00:00,  6.89it/s]\n"
     ]
    }
   ],
   "source": [
    "warnings.filterwarnings('ignore')\n",
    "X_distances = kg.jobs.map_batch_parallel(\n",
    "    list(range(len(df_all_texts))),\n",
    "    item_mapper=get_vector_distances,\n",
    "    batch_size=1000,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_distances = np.array(X_distances)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_distances = pd.DataFrame(\n",
    "    X_distances,\n",
    "    columns=[\n",
    "        'pos_tag_cosine',\n",
    "        'pos_tag_euclidean',\n",
    "        'ner_tag_euclidean',\n",
    "        'ner_tag_count_diff',\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build master feature list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_master = pd.concat(\n",
    "    [df_pos_q1, df_ner_q1, df_pos_q2, df_ner_q2, df_distances],\n",
    "    axis=1,\n",
    "    ignore_index=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_master.columns = list(df_pos_q1.columns) + \\\n",
    "    list(df_ner_q1.columns) + \\\n",
    "    list(df_pos_q2.columns) + \\\n",
    "    list(df_ner_q2.columns) + \\\n",
    "    list(df_distances.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>mean</th>\n",
       "      <th>std</th>\n",
       "      <th>min</th>\n",
       "      <th>25%</th>\n",
       "      <th>50%</th>\n",
       "      <th>75%</th>\n",
       "      <th>max</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>pos_q1_adj</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>1.102526</td>\n",
       "      <td>1.094043</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>26.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q1_adv</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.730120</td>\n",
       "      <td>0.865068</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>17.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q1_noun</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>2.878714</td>\n",
       "      <td>1.826381</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>42.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q1_propn</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.859963</td>\n",
       "      <td>1.320762</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>37.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q1_num</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.453513</td>\n",
       "      <td>1.477596</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>83.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q1_verb</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>2.375678</td>\n",
       "      <td>1.561820</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>62.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_gpe</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.150513</td>\n",
       "      <td>0.427541</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_loc</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.013706</td>\n",
       "      <td>0.125876</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_org</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.223405</td>\n",
       "      <td>0.503474</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_norp</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.052355</td>\n",
       "      <td>0.260447</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>7.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_person</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.097468</td>\n",
       "      <td>0.342881</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>6.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_product</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.013324</td>\n",
       "      <td>0.119395</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_date</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.049494</td>\n",
       "      <td>0.238229</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_time</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.015133</td>\n",
       "      <td>0.131363</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>5.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_quantity</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.027074</td>\n",
       "      <td>0.196348</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q1_cardinal</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.205036</td>\n",
       "      <td>0.833744</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>71.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_adj</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>1.105541</td>\n",
       "      <td>1.103751</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>26.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_adv</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.735044</td>\n",
       "      <td>0.872221</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>17.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_noun</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>2.872393</td>\n",
       "      <td>1.848079</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>4.000000</td>\n",
       "      <td>42.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_propn</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.859091</td>\n",
       "      <td>1.318321</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>37.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_num</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.458988</td>\n",
       "      <td>1.477077</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>83.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_q2_verb</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>2.402303</td>\n",
       "      <td>1.618294</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>62.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_gpe</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.151756</td>\n",
       "      <td>0.430246</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>10.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_loc</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.013766</td>\n",
       "      <td>0.126354</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_org</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.223277</td>\n",
       "      <td>0.504502</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_norp</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.052142</td>\n",
       "      <td>0.260487</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_person</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.096556</td>\n",
       "      <td>0.340550</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_product</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.013484</td>\n",
       "      <td>0.120184</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_date</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.051031</td>\n",
       "      <td>0.242660</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>9.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_time</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.015104</td>\n",
       "      <td>0.131250</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>6.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_quantity</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.027066</td>\n",
       "      <td>0.195606</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>11.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_q2_cardinal</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.206940</td>\n",
       "      <td>0.836230</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>71.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_tag_cosine</th>\n",
       "      <td>2749317.000000</td>\n",
       "      <td>0.172357</td>\n",
       "      <td>0.164213</td>\n",
       "      <td>-0.000000</td>\n",
       "      <td>0.054095</td>\n",
       "      <td>0.121690</td>\n",
       "      <td>0.239361</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>pos_tag_euclidean</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>3.110262</td>\n",
       "      <td>2.099757</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.732051</td>\n",
       "      <td>2.645751</td>\n",
       "      <td>3.872983</td>\n",
       "      <td>81.043198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_tag_euclidean</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.758212</td>\n",
       "      <td>1.064286</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>69.028979</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ner_tag_count_diff</th>\n",
       "      <td>2750086.000000</td>\n",
       "      <td>0.670950</td>\n",
       "      <td>1.152577</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>71.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                            count     mean      std       min      25%  \\\n",
       "pos_q1_adj         2750086.000000 1.102526 1.094043  0.000000 0.000000   \n",
       "pos_q1_adv         2750086.000000 0.730120 0.865068  0.000000 0.000000   \n",
       "pos_q1_noun        2750086.000000 2.878714 1.826381  0.000000 2.000000   \n",
       "pos_q1_propn       2750086.000000 0.859963 1.320762  0.000000 0.000000   \n",
       "pos_q1_num         2750086.000000 0.453513 1.477596  0.000000 0.000000   \n",
       "pos_q1_verb        2750086.000000 2.375678 1.561820  0.000000 1.000000   \n",
       "ner_q1_gpe         2750086.000000 0.150513 0.427541  0.000000 0.000000   \n",
       "ner_q1_loc         2750086.000000 0.013706 0.125876  0.000000 0.000000   \n",
       "ner_q1_org         2750086.000000 0.223405 0.503474  0.000000 0.000000   \n",
       "ner_q1_norp        2750086.000000 0.052355 0.260447  0.000000 0.000000   \n",
       "ner_q1_person      2750086.000000 0.097468 0.342881  0.000000 0.000000   \n",
       "ner_q1_product     2750086.000000 0.013324 0.119395  0.000000 0.000000   \n",
       "ner_q1_date        2750086.000000 0.049494 0.238229  0.000000 0.000000   \n",
       "ner_q1_time        2750086.000000 0.015133 0.131363  0.000000 0.000000   \n",
       "ner_q1_quantity    2750086.000000 0.027074 0.196348  0.000000 0.000000   \n",
       "ner_q1_cardinal    2750086.000000 0.205036 0.833744  0.000000 0.000000   \n",
       "pos_q2_adj         2750086.000000 1.105541 1.103751  0.000000 0.000000   \n",
       "pos_q2_adv         2750086.000000 0.735044 0.872221  0.000000 0.000000   \n",
       "pos_q2_noun        2750086.000000 2.872393 1.848079  0.000000 2.000000   \n",
       "pos_q2_propn       2750086.000000 0.859091 1.318321  0.000000 0.000000   \n",
       "pos_q2_num         2750086.000000 0.458988 1.477077  0.000000 0.000000   \n",
       "pos_q2_verb        2750086.000000 2.402303 1.618294  0.000000 1.000000   \n",
       "ner_q2_gpe         2750086.000000 0.151756 0.430246  0.000000 0.000000   \n",
       "ner_q2_loc         2750086.000000 0.013766 0.126354  0.000000 0.000000   \n",
       "ner_q2_org         2750086.000000 0.223277 0.504502  0.000000 0.000000   \n",
       "ner_q2_norp        2750086.000000 0.052142 0.260487  0.000000 0.000000   \n",
       "ner_q2_person      2750086.000000 0.096556 0.340550  0.000000 0.000000   \n",
       "ner_q2_product     2750086.000000 0.013484 0.120184  0.000000 0.000000   \n",
       "ner_q2_date        2750086.000000 0.051031 0.242660  0.000000 0.000000   \n",
       "ner_q2_time        2750086.000000 0.015104 0.131250  0.000000 0.000000   \n",
       "ner_q2_quantity    2750086.000000 0.027066 0.195606  0.000000 0.000000   \n",
       "ner_q2_cardinal    2750086.000000 0.206940 0.836230  0.000000 0.000000   \n",
       "pos_tag_cosine     2749317.000000 0.172357 0.164213 -0.000000 0.054095   \n",
       "pos_tag_euclidean  2750086.000000 3.110262 2.099757  0.000000 1.732051   \n",
       "ner_tag_euclidean  2750086.000000 0.758212 1.064286  0.000000 0.000000   \n",
       "ner_tag_count_diff 2750086.000000 0.670950 1.152577  0.000000 0.000000   \n",
       "\n",
       "                        50%      75%       max  \n",
       "pos_q1_adj         1.000000 2.000000 26.000000  \n",
       "pos_q1_adv         1.000000 1.000000 17.000000  \n",
       "pos_q1_noun        3.000000 4.000000 42.000000  \n",
       "pos_q1_propn       0.000000 1.000000 37.000000  \n",
       "pos_q1_num         0.000000 0.000000 83.000000  \n",
       "pos_q1_verb        2.000000 3.000000 62.000000  \n",
       "ner_q1_gpe         0.000000 0.000000 10.000000  \n",
       "ner_q1_loc         0.000000 0.000000  4.000000  \n",
       "ner_q1_org         0.000000 0.000000  8.000000  \n",
       "ner_q1_norp        0.000000 0.000000  7.000000  \n",
       "ner_q1_person      0.000000 0.000000  6.000000  \n",
       "ner_q1_product     0.000000 0.000000  4.000000  \n",
       "ner_q1_date        0.000000 0.000000  8.000000  \n",
       "ner_q1_time        0.000000 0.000000  5.000000  \n",
       "ner_q1_quantity    0.000000 0.000000 10.000000  \n",
       "ner_q1_cardinal    0.000000 0.000000 71.000000  \n",
       "pos_q2_adj         1.000000 2.000000 26.000000  \n",
       "pos_q2_adv         1.000000 1.000000 17.000000  \n",
       "pos_q2_noun        3.000000 4.000000 42.000000  \n",
       "pos_q2_propn       0.000000 1.000000 37.000000  \n",
       "pos_q2_num         0.000000 0.000000 83.000000  \n",
       "pos_q2_verb        2.000000 3.000000 62.000000  \n",
       "ner_q2_gpe         0.000000 0.000000 10.000000  \n",
       "ner_q2_loc         0.000000 0.000000  4.000000  \n",
       "ner_q2_org         0.000000 0.000000  8.000000  \n",
       "ner_q2_norp        0.000000 0.000000  8.000000  \n",
       "ner_q2_person      0.000000 0.000000  8.000000  \n",
       "ner_q2_product     0.000000 0.000000  4.000000  \n",
       "ner_q2_date        0.000000 0.000000  9.000000  \n",
       "ner_q2_time        0.000000 0.000000  6.000000  \n",
       "ner_q2_quantity    0.000000 0.000000 11.000000  \n",
       "ner_q2_cardinal    0.000000 0.000000 71.000000  \n",
       "pos_tag_cosine     0.121690 0.239361  1.000000  \n",
       "pos_tag_euclidean  2.645751 3.872983 81.043198  \n",
       "ner_tag_euclidean  0.000000 1.000000 69.028979  \n",
       "ner_tag_count_diff 0.000000 1.000000 71.000000  "
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_master.describe().T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = df_master[:len(tokens_train)].values\n",
    "X_test = df_master[len(tokens_train):].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "X train: (404290, 36)\n",
      "X test:  (2345796, 36)\n"
     ]
    }
   ],
   "source": [
    "print('X train:', X_train.shape)\n",
    "print('X test: ', X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_names = list(df_master.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "project.save_features(X_train, X_test, feature_names, feature_list_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
